New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test/integration: goroutine leak check #115456
Conversation
@@ -228,3 +233,10 @@ func StartTestServer(t testing.TB, setup TestServerSetup) (client.Interface, *re | |||
|
|||
return kubeAPIServerClient, kubeAPIServerClientConfig, tearDownFn | |||
} | |||
|
|||
type checkNothing struct{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alternative is (untested):
func init() {
_ := healtz.LogHealthz.Check(nil)
}
That starts the goroutine during init and then goleak.IgnoreCurrent
knows about it.
I'm undecided whether that is a better solution than replacing the instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we plumb a context on the handler to stop it on apiserver shutdown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems doesn't work
found unexpected goroutines:
[Goroutine 2263 in state select, with k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.BackoffUntil on top of the stack:
goroutine 2263 [select]:
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x59f07c8?, {0x59be860, 0xc0092efbf0}, 0x1, 0xc0000925a0)
/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:169 +0x135
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x59c0fa0?, 0xdf8475800, 0x0, 0x25?, 0xc00317ef00?)
/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:135 +0x89
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.Until(...)
/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:92
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.Forever(0x0?, 0x0?)
/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:83 +0x28
created by k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/healthz.(*log).Check.func1
/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/healthz/healthz.go:74 +0xe5
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we plumb a context on the handler to stop it on apiserver shutdown?
No, healthz.go initializes the variable during init. Goroutine startup then gets delayed until used for the first time.
it seems doesn't work
I'll check that tomorrow. So far, I was testing only with test/integration/scheduler_perf
. make test-integration
fails for me for various reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have multiple copies of StartTestServer
. I modified the one in test/integration/framework/test_server.go
, but not the one in cmd/kube-apiserver/app/testing/
... let's do this all in etcd.go where the leak checking is done.
after := runtime.NumGoroutine() | ||
stacktraces := make([]byte, 1<<20) | ||
runtime.Stack(stacktraces, true) | ||
klog.Fatalf("unexpected number of goroutines: before: %d after %d\n%sd", before, after, string(stacktraces)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that replacing this code also fixes a slight bug: the string that gets written here is 1<<20
bytes long.
The code should have used:
len := runtime.Stack(stacktraces, true)
... string(stacktraces[:len]) ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are awesome 👏
this is more on sig-testing side of things |
Yes, and the change to |
90fddf0
to
5b23729
Compare
go.mod
Outdated
@@ -74,6 +74,7 @@ require ( | |||
go.opentelemetry.io/otel/sdk v1.10.0 | |||
go.opentelemetry.io/otel/trace v1.10.0 | |||
go.opentelemetry.io/proto/otlp v0.19.0 | |||
go.uber.org/goleak v1.2.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't want to add a dependency for this, especially not one that has any transitive dependencies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do have some code we can reorganize into a package if needed:
kubernetes/test/integration/framework/etcd.go
Lines 185 to 215 in b166791
before := runtime.NumGoroutine() | |
stop, err := startEtcd() | |
if err != nil { | |
klog.Fatalf("cannot run integration tests: unable to start etcd: %v", err) | |
} | |
result := tests() | |
stop() // Don't defer this. See os.Exit documentation. | |
checkNumberOfGoroutines := func() (bool, error) { | |
// We leave some room for leaked goroutines as there are | |
// still some leaks, mostly: | |
// - leak from lumberjack package we're vendoring | |
// - leak from apiserve healthz | |
// - leak from opencensus library | |
// Once fixed, we should be able to bring it down to zero. | |
if dg := runtime.NumGoroutine() - before; dg <= 3 { | |
return true, nil | |
} | |
// Allow goroutines to schedule and die off. | |
runtime.Gosched() | |
return false, nil | |
} | |
// It generally takes visibly less than 1s to finish all goroutines. | |
// But we keep the limit higher to account for cpu-starved environments. | |
if err := wait.Poll(100*time.Millisecond, 5*time.Second, checkNumberOfGoroutines); err != nil { | |
after := runtime.NumGoroutine() | |
stacktraces := make([]byte, 1<<20) | |
runtime.Stack(stacktraces, true) | |
klog.Fatalf("unexpected number of goroutines: before: %d after %d\n%sd", before, after, string(stacktraces)) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
current method is simple and "good enough" for our purposes,
It's not 😞
The current code doesn't support ignoring the goroutines that were started during init. That's crucial for reliably ignoring some of the goroutines that currently force us to do the fuzzy "3 goroutine leaks are okay" thing.
We do have some code we can reorganize into a package if needed:
I had started improving that code, but then realized that I would end up re-implementing goleak.
Is the problem with goleak the dependencies of the package? The actual code that we pull is pretty simple.
What about simply copying that code into test/utils/goleak
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check-dependency-stats
doesn't look too bad to me:
Switched to a new branch 'base'
--- /logs/artifacts/stats-base.txt 2023-02-01 16:00:51.332458491 +0000
+++ /logs/artifacts/stats.txt 2023-02-01 16:00:50.944422049 +0000
@@ -1,6 +1,6 @@
Direct Dependencies: 210
-Transitive Dependencies: 235
-Total Dependencies: 285
+Transitive Dependencies: 236
+Total Dependencies: 286
Max Depth Of Dependencies: 23
All dependencies:
@@ -249,6 +249,7 @@
go.uber.org/multierr
go.uber.org/zap
golang.org/x/crypto
+golang.org/x/lint
golang.org/x/mod
golang.org/x/net
golang.org/x/oauth2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The integration tests are now passing with the vague "allow 3 additional goroutines" replaced with more precise suppressions through goleak.Ignore*
. I found that in two cases I needed goleak.IgnoreTopFunction
. I'd prefer just using goleak as-is instead of investing further effort into improving our own code or forking goleak (regardless whether that's under "third_party/forked" or "test").
@liggitt: WDYT, is golang.org/x/lint
as new additional dependency besides goleak itself acceptable?
Or is this about other dependencies that don't show up in go.mod
and check-dependency-stats
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liggitt: WDYT, is golang.org/x/lint as new additional dependency besides goleak itself acceptable?
- golang.org/x/lint is marked deprecated and frozen... I don't have confidence it will be updated if required to continue working with new go versions, so I really don't want it in our graph (I opened Add golang.org/x/lint to unwanted dependencies #115488 to record that, since we're not currently depending on it). It looks like the leak library only depends on it because they explicitly take an
_
import on it and run it in theirmake lint
target. perhaps an upstream issue/contribution to drop that and scope that down togo vet
would be accepted?
Or is this about other dependencies that don't show up in go.mod and check-dependency-stats?
-
check-dependency-stats
isn't surfacing it, but even though go.uber.org/goleak was in our graph before, it was unused... now it is actively used and vendored, so any changes / additions to its dependency graph impact us more directly now. It looks like the only dependency it currently has (aside from lint) is testify, which we also have, so there's not a current transitive dependency that gets pulled in that's problematic... that's more of an unknown/future growth concern. -
The last thought I have about this is that goroutine leak detection seems likely to potentially need tweaking around go releases... depending on a library to do this seems like it will be slower to react if changes are needed around a new go version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened uber-go/goleak#92 with a request / suggestion for dropping the golang.org/x/lint reference from the graph
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I submitted a PR for goleak. Let's wait for that.
5b23729
to
6074f9f
Compare
5ea0042
to
7cff84f
Compare
test/integration/framework/etcd.go
Outdated
// There's no good way to ignore this particular loop. | ||
// Matching against Sleep is only indirect and not very | ||
// precise. | ||
goleak.IgnoreTopFunction("time.Sleep"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not happy with this particular Ignore. However, adding some kind of forced shutdown to
kubernetes/staging/src/k8s.io/client-go/tools/record/event.go
Lines 215 to 231 in ae23b0c
for { | |
if recordEvent(sink, result.Event, result.Patch, result.Event.Count > 1, eventCorrelator) { | |
break | |
} | |
tries++ | |
if tries >= maxTriesPerEvent { | |
klog.Errorf("Unable to write event '%#v' (retry limit exceeded!)", event) | |
break | |
} | |
// Randomize the first sleep so that various clients won't all be | |
// synced up if the master goes down. | |
if tries == 1 { | |
time.Sleep(time.Duration(float64(sleepDuration) * rand.Float64())) | |
} else { | |
time.Sleep(sleepDuration) | |
} | |
} |
I'm open for suggestions...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a context to
recordToSink(ctx context.Context, sink EventSink, event *v1.Event, eventCorrelator *EventCorrelator, sleepDuration time.Duration)
and cancel the context on Shutdown
kubernetes/staging/src/k8s.io/client-go/tools/record/event.go
Lines 198 to 200 in ae23b0c
func (e *eventBroadcasterImpl) Shutdown() { | |
e.Broadcaster.Shutdown() | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure whether that is possible. There are comments elsewhere (don't remember exactly where) that say that pending events will still get delivered after a shutdown, so the current behavior might be intentional. Changing it might break something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides, adding a context to recordToSink
then implies changing the StartEventWatcher
API - let's not.
The problem is that the sink blocks when it should simply give up:
rsc.eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: rsc.kubeClient.CoreV1().Events("")}) |
There's a TODO from 2016 about EventSinkImpl
:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides, adding a context to
recordToSink
then implies changing theStartEventWatcher
API - let's not.
why, the context is internal and closed by Shutdown()
, I didn't mean to expose it.
that say that pending events will still get delivered after a shutdown
Events are ephemeral, but we can give some grace period and try to drain, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eventBroadcasterImpl
doesn't have a context that it could pass to recordToSink
. I guess it would be possible to create one just for this purpose. Let me try this and then we can ask others for their opinion whether it is a breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed an update. What I meant with "change Watch API" is that the result of StartRecordingToSink
(a watch.Interface
) cannot be used to shutdown the watch and the delivery callback. That is harder to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to have worked. I've turned the change into a separate PR here: #115514
/retest |
8789f0d
to
ca25628
Compare
I rebased to get rid of the already merged EventBroadcaster.Shutdown commit. I also updated to goleak master because that removes the problematic go/lint dependency. I've asked for a proper release. Do we need to wait for that? I'd prefer to get this PR merged soonish because it blocks further work on |
38baf1e
to
a939e71
Compare
goleak v1.2.1 was release and this PR now uses it. @liggitt: is this okay now as dependency? If this ever causes problems during a Go update, it'll be trivial to comment out and restore later. It's only useful in tests and thus there will be no production code which could possibly depend on it. |
It provides more readable output and has additional APIs for using it inside a unit test. goleak.IgnoreCurrent is needed to filter out the goroutine that gets started when importing go.opencensus.io/stats/view. In order to handle background goroutines that get created on demand and cannot be stopped (like the one for LogzHealth), a helper function ensures that those are running before calling goleak.IgnoreCurrent. Keeping those goroutines running is not a problem and thus not worth the effort of adding new APIs to stop them. Other goroutines are genuine leaks for which no fix is available. Those get suppressed via IgnoreTopFunction, which works as long as that function is unique enough. Example output for the leak fixed in kubernetes#115423: E0202 09:30:51.641841 74789 etcd.go:205] "EtcdMain goroutine check" err=< found unexpected goroutines: [Goroutine 4889 in state chan receive, with k8s.io/apimachinery/pkg/watch.(*Broadcaster).loop on top of the stack: goroutine 4889 [chan receive]: k8s.io/apimachinery/pkg/watch.(*Broadcaster).loop(0xc0076183c0) /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/watch/mux.go:268 +0x65 created by k8s.io/apimachinery/pkg/watch.NewBroadcaster /nvme/gopath/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/watch/mux.go:77 +0x116 >
a939e71
to
f131cab
Compare
yeah, this is more reasonable as a dependency now, thanks for helping clean it up /lgtm |
LGTM label has been added. Git tree hash: e293518a443aa90339a0a5551420998962b610a8
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: liggitt, pohly The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
/triage accepted |
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
The current approach with ignoring a certain number of additional goroutines is too vague. During testing, we want to ensure that no additional goroutines at all get left behind by a test.
Some goroutines being started during init are okay and go-uber/goleak can handle that, in contrast to the old in-tree code. For that reason and because goleak has nicer output, it replaces the in-tree code.
Does this PR introduce a user-facing change?